knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
library(lubridate)
library(tidyverse)
library(plotly)
library(dplyr)
library(ggplot2)
library(hrbrthemes)
library(viridis)
library(MASS)
library(reshape2)
library(reshape)
library(ggmap)
library(leaflet)
library(stringr)
library(wordcloud)
stop <- read.csv("stopdata.csv", sep = ",")
stop$date <- ymd_hms(stop$stop_datetime)
stop$doj_record_id <- as.factor(stop$doj_record_id)
stop <- stop |>
group_by(unique_identifier) |>
mutate(count = n())Police Department Stop Data
This document is a product of the final project for the STAT 570 lecture, focusing on data handling and visualization tools. It is essential to acknowledge that minor errors may be present, and the methods employed may not necessarily reflect the optimal approach related to the data set.
Author:
Buse Bakış - buse.bakis@metu.edu.tr
The San Francisco Police Department (SFPD) Stop Data was designed to capture information to comply with the Racial and Identity Profiling Act (RIPA), or California Assembly Bill (AB)953. SFPD officers collect specific information on each stop, including elements of the stop, circumstances and the perceived identity characteristics of the individual(s) stopped. The information obtained by officers is reported to the California Department of Justice. This data set includes data on stops starting on July 1st, 2018, which is when the data collection program went into effect. Also, the data set includes information about police stops that occurred, including some details about the person(s) stopped, and what happened during the stop. Each row is a person stopped with a record identifier for the stop and a unique identifier for the person. A single stop may involve multiple people and may produce more than one associated unique identifier for the same record identifier.
Introduction
In this project to investigate police stop data and traffic accident that caused a person injury in San Francisco data is used. One of the data sets is between in 2018-2023 (Police Stop Data), and the other one is between in 2005-2023 (Accident data). However, to be able to work in the same period, it was selected as 2018-2023.
There is a 244934 observation and 87 variable in Police Stop data, and 57456 observation and 57 variable in Accident data.
Firstly, the structure of the Police Stop data was investigated. Only numeric variables are numeric and others even if date variable was also character variable, date variable changed as date with the function from lubridate package, and the monthly trends of number of stops is check with the interactive plotly line plot.
There was a huge dramatic decrease in February 2020, probably pandemic situation is affected the rate of the stop people to be safe from Covid with less interaction. May be police officer just give the penalty tickets without stopping.
stop_monthly <- stop |>
group_by(date = lubridate::floor_date(date, 'month')) |>
summarize(count = sum(count))
fig <- plot_ly(stop_monthly, x = ~date, y = ~count, type = 'scatter', mode = 'lines') |>
layout(title = 'Trend of number of people stoped by police 2018-2023', plot_bgcolor = "#e5ecf6", xaxis = list(title = 'Date'),
yaxis = list(title = 'Count of stop'))
figThis data set comprises numerous records of police stops, yet some of them have undergone status changes, specifically being marked as “deleted.” For this project, only data with the status “Completed - Successful Submission” has been included.
stop_age <- stop |>
group_by(gender = as.factor(perceived_gender), age = perceived_age) |>
summarize(count = sum(count))`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
ggplot(stop_age, aes(x = age , y = count, colour = gender )) +
geom_line()Warning: Removed 1 row containing missing values (`geom_line()`).
stop <- stop |>
filter(perceived_age < 80 && perceived_age > 16) |>
filter(stop_data_record_status == "Completed - Successful Submission")
factor_stop <- as.data.frame(unclass(stop), stringsAsFactors = TRUE)
stop_age2 <- stop |>
group_by(gender = as.factor(perceived_gender), age = as.factor(perceived_age_group)) |>
summarize(count =n()) |>
filter(gender %in% c("Male","Female"))`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
ggplot(stop_age2, aes(fill=gender, y=count, x=age)) +
geom_bar(position="dodge", stat="identity") +
scale_fill_viridis(discrete = T, option = "E") +
ggtitle("Gender vs Age Groups") +
facet_wrap(~gender) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
xlab("")Demographic Information
Certain demographic information has been examined, starting with the analysis of gender and age through a line plot. The most of them are male and their ages is in between 25 to 50. The most prevalent gender, among others, is female within the same age group. A minority group of transgender individuals is present in the data set. As evident from the plot, outliers in age are apparent. In California, the minimum age for obtaining a driver’s license is 16.5, yet the data starts from 1. Additionally, individuals above the age of 80 should not be eligible to drive, but there are values close to 125. To address this, the data has been filtered to include ages between 16 and 80.
summary(factor_stop$perceived_race_ethnicity) Asian Black/African American
27041 59846
Hispanic/Latino(a) Middle Eastern or South Asian
47937 16152
Multi-racial Native American
5989 356
Pacific Islander White
2959 84654
summary(factor_stop$perceived_age_group) 18 - 29 30 - 39 40 - 49 50 - 59 60 or over Under 18
64295 78355 51320 33524 16673 767
stop_eth <- stop |>
group_by(age = perceived_age_group,
ethnic = perceived_race_ethnicity) |>
summarize(count = sum(count))`summarise()` has grouped output by 'age'. You can override using the `.groups`
argument.
levels(as.factor(stop_eth$ethnic))[1] "Asian" "Black/African American"
[3] "Hispanic/Latino(a)" "Middle Eastern or South Asian"
[5] "Multi-racial" "Native American"
[7] "Pacific Islander" "White"
stop_eth <- stop_eth |>
mutate(count = case_when(ethnic == "Asian" ~ count*0.17,
ethnic == "Middle Eastern or South Asian"~ count*0.17,
ethnic == "Hispanic/Latino(a)"~ count*0.15,
ethnic == "White"~ count*0.43,
ethnic == "Black/African American"~ count*0.052,
ethnic == "Multi-racial"~ count*0.08,
ethnic == "Native American"~ count*0.005,
ethnic == "Pacific Islander"~ count*0.003)
)
ggplot(data=stop_eth, aes(x=age, y=count, fill=ethnic)) +
geom_bar(stat="identity") +
scale_fill_brewer(palette="Blues")After checking age and gender values, ethnicity of person is investigated. Across all age groups, individuals identified as white are most frequently stopped by police officers, which aligns with the fact that almost half of the population consists of white individuals. Interestingly, despite Hispanics making up 15% of the population, they exhibit a higher rate of police stops compared to other ethnicities.
Accident Data vs Police Stop Data Trend
To check if there is any connection between the stop rate and crash rate, we used Traffic Accident data collected from the San Francisco government open data library, and we adjusted the date values similar to the corrections made in the Police Stop data. Even if crash rate is less than the police stop rate, it has similar trend with the stop rate. At the beginning of pandemic, the crash rate sharply decreased but after a while it started increase again. Perhaps this trend is influenced by seasonality, as there is a noticeable decrease at the beginning of each year.
crash <- read.csv("crashes.csv", sep = ",")
head(crash) unique_id cnn_intrsctn_fkey cnn_sgmt_fkey case_id_pkey tb_latitude
1 4734 21703000 9150000 2008161 37.73103
2 44875 27577000 3578000 2238762 37.78552
3 39944 27694000 12281101 2116991 37.75097
4 10509 25668000 6921000 2183936 37.75558
5 18041 25876000 4008000 2382577 37.76849
6 44932 27567000 3937000 2409306 37.78735
tb_longitude geocode_source geocode_location collision_datetime
1 -122.4290 SFPD-CROSSROADS CITY STREET 04/09/2005 02:03:00 AM
2 -122.4604 SFPD-CROSSROADS CITY STREET 09/06/2005 10:30:00 AM
3 -122.4950 SFPD-CROSSROADS CITY STREET 06/23/2005 02:30:00 PM
4 -122.4296 SFPD-CROSSROADS CITY STREET 08/10/2005 03:26:00 PM
5 -122.4290 SFPD-CROSSROADS CITY STREET 11/03/2005 02:21:00 PM
6 -122.4570 SFPD-CROSSROADS CITY STREET 12/26/2005 08:03:00 AM
collision_date collision_time accident_year month day_of_week
1 2005 April 09 02:03:00 2005 April Saturday
2 2005 September 06 10:30:00 2005 September Tuesday
3 2005 June 23 14:30:00 2005 June Thursday
4 2005 August 10 15:26:00 2005 August Wednesday
5 2005 November 03 14:21:00 2005 November Thursday
6 2005 December 26 08:03:00 2005 December Monday
time_cat juris officer_id reporting_district beat_number
1 2:01 am to 6:00 am 3801 4216 004
2 10:01 am to 2:00 pm 3801 625 G 4B4G
3 2:01 pm to 6:00 pm 3801 1629 3I3
4 2:01 pm to 6:00 pm 3801 002122 D 4B6D
5 2:01 pm to 6:00 pm 3801 1105 3E61
6 6:01 am to 10:00 am 3801 334 RICHM 3G2
primary_rd secondary_rd distance direction weather_1 weather_2
1 MISSION ST TRUMBULL ST 132 North Cloudy Raining
2 CALIFORNIA ST 02ND AVE 37 West Cloudy Not Stated
3 SUNSET BLVD ORTEGA ST 211 South Cloudy Not Stated
4 HILL ST SANCHEZ ST 133 East Clear Not Stated
5 CHURCH ST RESERVOIR ST 66 South Clear Not Stated
6 JORDAN AVE CALIFORNIA ST 500 North Clear Not Stated
collision_severity type_of_collision mviw
1 Injury (Complaint of Pain) Hit Object Fixed Object
2 Injury (Severe) Other Bicycle
3 Injury (Complaint of Pain) Hit Object Fixed Object
4 Injury (Other Visible) Broadside Parked Motor Vehicle
5 Injury (Other Visible) Overturned Fixed Object
6 Injury (Complaint of Pain) Rear End Parked Motor Vehicle
ped_action road_surface road_cond_1 road_cond_2
1 No Pedestrian Involved Wet No Unusual Condition Not Stated
2 No Pedestrian Involved Dry No Unusual Condition Not Stated
3 No Pedestrian Involved Dry No Unusual Condition Not Stated
4 In Road, Including Shoulder Dry No Unusual Condition Not Stated
5 No Pedestrian Involved Dry Holes, Deep Ruts Not Stated
6 No Pedestrian Involved Wet No Unusual Condition Not Stated
lighting control_device intersection vz_pcf_code vz_pcf_group
1 Dark - Street Lights None Midblock > 20ft Unknown Unknown
2 Daylight None Midblock > 20ft 22517 22517
3 Daylight Functioning Midblock > 20ft Unknown Unknown
4 Daylight None Midblock > 20ft Unknown Unknown
5 Daylight None Midblock > 20ft Unknown Unknown
6 Daylight None Midblock > 20ft Unknown Unknown
vz_pcf_description
1 Unknown
2 Opening door on traffic side when unsafe
3 Unknown
4 Unknown
5 Unknown
6 Unknown
vz_pcf_link
1 Not Stated
2 http://leginfo.legislature.ca.gov/faces/codes_displaySection.xhtml?lawCode=VEH§ionNum=22517
3 Not Stated
4 Not Stated
5 Not Stated
6 Not Stated
number_killed number_injured
1 0 4
2 0 1
3 0 1
4 0 1
5 0 1
6 0 1
street_view
1 https://maps.google.com/maps?q=&layer=c&cbll=37.7310277112579,-122.429020773456
2 https://maps.google.com/maps?q=&layer=c&cbll=37.785522231882,-122.46043606478
3 https://maps.google.com/maps?q=&layer=c&cbll=37.750971046041,-122.494976107694
4 https://maps.google.com/maps?q=&layer=c&cbll=37.7555838090639,-122.429578947769
5 https://maps.google.com/maps?q=&layer=c&cbll=37.7684931481079,-122.429037469419
6 https://maps.google.com/maps?q=&layer=c&cbll=37.7873546399109,-122.456963340925
dph_col_grp dph_col_grp_description party_at_fault party1_type
1 AA Vehicle(s) Only Involved NA Driver
2 CC Vehicle-Bicycle 1 Driver
3 AA Vehicle(s) Only Involved NA Driver
4 BB Vehicle-Pedestrian 1 Driver
5 FF Bicycle Only NA Bicyclist
6 AA Vehicle(s) Only Involved 1 Driver
party1_dir_of_travel party1_move_pre_acc party2_type party2_dir_of_travel
1 Not Stated Proceeding Straight
2 East Parked Bicyclist East
3 North Ran Off Road
4 East Other Unsafe Turning Parked Vehicle Not Stated
5 South Proceeding Straight
6 North Proceeding Straight Parked Vehicle North
party2_move_pre_acc point
1 POINT (-122.429020773 37.731027711)
2 Proceeding Straight POINT (-122.460436065 37.785522232)
3 POINT (-122.494976108 37.750971046)
4 Parked POINT (-122.429578948 37.755583809)
5 POINT (-122.429037469 37.768493148)
6 Parked POINT (-122.456963341 37.78735464)
data_as_of data_updated_at data_loaded_at
1 04/09/2005 12:00:00 AM 04/26/2023 12:00:00 AM 12/12/2023 12:18:08 PM
2 09/06/2005 12:00:00 AM 04/26/2023 12:00:00 AM 12/12/2023 12:18:08 PM
3 06/23/2005 12:00:00 AM 04/26/2023 12:00:00 AM 12/12/2023 12:18:08 PM
4 08/10/2005 12:00:00 AM 04/26/2023 12:00:00 AM 12/12/2023 12:18:08 PM
5 11/03/2005 12:00:00 AM 04/26/2023 12:00:00 AM 12/12/2023 12:18:08 PM
6 12/26/2005 12:00:00 AM 04/26/2023 12:00:00 AM 12/12/2023 12:18:08 PM
crash$collision_datetime <- sub("[[:space:]].*", "", crash$collision_datetime)
crash$collision_datetime <- as.Date(crash$collision_datetime, "%m/%d/%Y")
crash$date <- as.POSIXct(crash$collision_datetime, format="%Y-%m-%d",tz="UTC")
crash <- crash |>
group_by(unique_id) |>
mutate(count = n())
crash_monthly <- crash |>
group_by(date = lubridate::floor_date(date, 'month')) |>
summarize(count = sum(count))
crash_monthly <- subset(crash_monthly, date >= "2018-07-01" & date <= "2023-06-01")
crash_monthly <- rbind(crash_monthly, c("2023-06-01", 211))
monthly_data <- cbind(crash_monthly, stop_monthly)
monthly_data <- monthly_data[,-3]
colnames(monthly_data) <- c("date", "crash_count", "stop_count")
melt_data <- melt(monthly_data, id = c("date")) Warning in melt.data.frame(monthly_data, id = c("date")): partial argument
match of 'id' to 'id.vars'
#melt_data
ggplot(crash_monthly, aes(x = date, y = count)) +
geom_line() +
scale_color_manual(values = "darkred")ggplot(melt_data, aes(x = date, y = value)) +
geom_line(aes(color = variable, linetype = variable)) +
scale_color_manual(values = c("darkred", "steelblue"))Duration Time according to Demographic Information
When the police officer is stopped the person, is the ethnicity or the age affected the stop duration time? To see it, some interactive box-plots are created:
no_na_df<- stop[!is.na(stop$latitude), ]
df <- no_na_df |>
group_by(longitude,latitude) |>
summarise(cnt = n())`summarise()` has grouped output by 'longitude'. You can override using the
`.groups` argument.
out <- boxplot.stats(no_na_df$duration_of_stop)$out
out_ind <- which(no_na_df$duration_of_stop %in% c(out))
#out_ind
df2 <- no_na_df[-out_ind, ]
plot_ly(
data = df2,
x = ~perceived_race_ethnicity,
y = ~duration_of_stop,
type = "box",
color = ~perceived_race_ethnicity,
showlegend = FALSE) %>%
layout(xaxis = list(title = 'Ethnicity'),
yaxis = list(title = 'Duration time (minute)' )
)At the plot above, it can be seen that African American people have the highest duration of stop rate, but in Asians (normal and middle/south eastern) there is an interesting situation. Although their average duration is shorter from many other ethnicity but they includes many outliers in higher minutes.
plot_ly(
data = df2,
x = ~perceived_age_group,
y = ~duration_of_stop,
type = "box",
color = ~perceived_age_group,
showlegend = FALSE) %>%
layout(xaxis = list(title = 'Age Group'),
yaxis = list(title = 'Duration time (minute)' )
)Age groups are checked according to duration time. As it is expected, 60 or over people duration time is shorter than the other age groups, but under 18 age group have the highest average duration rate and the longest ranges. However, other age groups duration minute rate are almost exactly the same.
Stop Points vs Crash Points
Police stop points are generally in the same spots. In the Police Stop data set there was some geospatial values for stopping spot as longitude and latitude information. To see this spot in the map longitude and latitude values are grouped and counted as how many people stopped there. The most of the stop points had just 1 stopped. However to see the most popular stopping places having more than 100 stop places are selected. In the one place, the police officer stopped 2569 different people. To see the crashes in the map firstly longitude and latitude values created from point variable because it collected weird way as POINT (…) (What?). To create the geospatial information from this variable with the StringR packages, some string functions deleted unnecessary characters, then separated variable into two different variables as longitude and latitude.
summary(df$cnt) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.00 3.00 13.62 10.00 2542.00
mybins <- seq(100, 2569, by=400)
mypalette <- colorBin( palette="YlOrBr", domain=quakes$mag, na.color="transparent", bins=mybins)
m <- leaflet(df) %>%
addTiles() %>%
setView( lat=37.773972, lng=-122.431297, zoom=11) %>%
addProviderTiles("Esri.WorldImagery") %>%
addCircleMarkers(~longitude, ~latitude,
fillColor = ~mypalette(cnt), fillOpacity = 0.7, color="white", radius=8, stroke=FALSE,
labelOptions = labelOptions( style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "13px", direction = "auto")
) %>%
addLegend(pal=mypalette, values=~cnt, opacity=0.5, title = "count", position = "bottomright" )Warning in mypalette(cnt): Some values were outside the color scale and will be
treated as NA
head(crash$point)[1] "POINT (-122.429020773 37.731027711)" "POINT (-122.460436065 37.785522232)"
[3] "POINT (-122.494976108 37.750971046)" "POINT (-122.429578948 37.755583809)"
[5] "POINT (-122.429037469 37.768493148)" "POINT (-122.456963341 37.78735464)"
crash$point<-gsub("POINT ","",as.character(crash$point))
crash$point<-gsub(")","",as.character(crash$point))
crash$point<-gsub("^.{0,1}","",as.character(crash$point))
df1 <- crash
df1[c('lon', 'lat')] <- str_split_fixed(crash$point, ' ', 2)
df1$lon <- as.numeric(df1$lon)
df1$lat <- as.numeric(df1$lat)
df1 <- df1 |>
group_by(lon,lat) |>
summarise(cnt = n())`summarise()` has grouped output by 'lon'. You can override using the `.groups`
argument.
summary(df1) lon lat cnt
Min. :-122.5 Min. :37.71 Min. : 1.000
1st Qu.:-122.4 1st Qu.:37.75 1st Qu.: 1.000
Median :-122.4 Median :37.77 Median : 1.000
Mean :-122.4 Mean :37.76 Mean : 1.837
3rd Qu.:-122.4 3rd Qu.:37.78 3rd Qu.: 1.000
Max. :-122.4 Max. :37.83 Max. :197.000
NA's :1 NA's :1
mybins1 <- seq(5, 197, by=30)
mypalette1 <- colorBin( palette="YlOrBr", domain=df1$cnt, na.color="transparent", bins=mybins1)
m1 <- leaflet(df1) %>%
addTiles() %>%
setView(lat=37.773972, lng=-122.431297, zoom=11) %>%
addProviderTiles("Esri.WorldImagery") %>%
addCircleMarkers(~lon, ~lat,
fillColor = ~mypalette1(cnt), fillOpacity = 0.7, color="white", radius=8, stroke=FALSE,
labelOptions = labelOptions( style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "13px", direction = "auto")
) %>%
addLegend(pal=mypalette1, values=~cnt, opacity=0.5, title = "count", position = "bottomright" )Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
missing or invalid lat/lon values and will be ignored
Warning in mypalette1(cnt): Some values were outside the color scale and will
be treated as NA
par(mfrow=c(1,2))
mm1After mapping the longitude and latitude variables, there were no surprising results; the upper east side of the city emerged as the most popular location for both crashes and police stops. This area witnessed the highest number of crashes, suggesting that police officers may also be concentrated there to prevent accidents.
Districts and Actions
In San Francisco, there is many districts. However, in this data set, district levels are more than the number of districts. Because some of district wrote as lower case, others are upper case and some of them as N/A. Firstly, N/A’s are dropped from data and others are converted as upper case. After these steps, the district information grouped by and counted and visualized with the word cloud plot. So, the most stop occurred in Southern, Mission and Central of San Francisco. Also, action that have taken is investigated in this project, but also this variable is collected with the weird way, the police officer wrote different action that have taken in one individual with “|” as separator. For dividing the sentences, we used the seperate_delim_longer function. The separated sentences as different value but it leaves white spaces in the end and beginning of the sentences, and this white spaces deleted with str_trim function. Therefore, all the actions taken became a level, and visualized it also word cloud plot. Therefore, it can be seen as the most taken action is “Patrol car detention”.
area <- dplyr::select(stop, c("city","district","unique_identifier", "actions_taken"))
area$city <- as.factor(area$city)
area$district <- as.factor(area$district)
head(area)# A tibble: 6 × 4
# Groups: unique_identifier [6]
city district unique_identifier actions_taken
<fct> <fct> <chr> <chr>
1 SAN FRANCISCO TARAVAL U380119326E22D6100AC_5 Search of person was con…
2 SAN FRANCISCO OUT OF SF / UNK U3801223400D4887116B_1 Property was seized | Ha…
3 SAN FRANCISCO TENDERLOIN U380121054B6A404E1D6_1 Search of person was con…
4 SAN FRANCISCO BAYVIEW U380118193C7D32D472B_1 Search of person was con…
5 SAN FRANCISCO BAYVIEW U38012029453C383DF7A_1 Search of person was con…
6 SAN FRANCISCO OUT OF SF / UNK U3801210484E3C91B19E_1 Search of property was c…
sum(is.na(area$actions_taken))[1] 0
long <- area %>% separate_longer_delim(actions_taken, "|")
long$actions_taken <- str_trim(long$actions_taken, side = c("both")) #if there is a white space it will delete
long$actions_taken <- as.factor(long$actions_taken)
#levels(long$actions_taken)
#levels(long$district)
long$district <- toupper(long$district)
long$district <- as.factor(long$district)
long1 <- subset(long, district =! "#N/A" )Warning: In subset.data.frame(long, district = !"#N/A") :
extra argument 'district' will be disregarded
#levels(long1$district)
df3 <- long |>
group_by(district, actions_taken) |>
summarise(count = n()) |>
arrange(desc(count))`summarise()` has grouped output by 'district'. You can override using the
`.groups` argument.
df4 <- subset(df3, actions_taken != "None")
df3 <- df3 |>
group_by(district) |>
summarise(count=sum(count))
df4 <- df4 |>
group_by(actions_taken) |>
summarise(count=sum(count))
df3 %>% with(wordcloud(district, count, max.words = 30, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2")))df4 %>% with(wordcloud(actions_taken, count, max.words = 200, random.order = FALSE, rot.per = 0.15,
colors = brewer.pal(8, "Dark2")))Warning in wordcloud(actions_taken, count, max.words = 200, random.order =
FALSE, : Curbside detention could not be fit on page. It will not be plotted.
Warning in wordcloud(actions_taken, count, max.words = 200, random.order =
FALSE, : Search of person was conducted could not be fit on page. It will not
be plotted.
Warning in wordcloud(actions_taken, count, max.words = 200, random.order =
FALSE, : Handcuffed or flex cuffed could not be fit on page. It will not be
plotted.
Warning in wordcloud(actions_taken, count, max.words = 200, random.order =
FALSE, : Search of property was conducted could not be fit on page. It will not
be plotted.
References
San Francisco, California Population 2024. Worldpopulationreview. https://worldpopulationreview.com/us-cities/san-francisco-ca-population
Police Department Stop Data. DataSF. https://data.sfgov.org/Public-Safety/Police-Department-Stop-Data/ubqf-aqzw/about_data